Non-blocking memory management mechanism for supporting dynamic-sized data structures

ABSTRACT

Solutions to a value recycling problem that we define herein facilitate implementations of computer programs that may execute as multithreaded computations in multiprocessor computers, as well as implementations of related shared data structures. Some exploitations of the techniques described herein allow non-blocking, shared data structures to be implemented using standard dynamic allocation mechanisms (such as malloc and free). Indeed, we present several exemplary realizations of dynamic-sized, non-blocking shared data structures that are not prevented from future memory reclamation by thread failures and which depend (in some implementations) only on widely-available hardware support for synchronization. Some exploitations of the techniques described herein allow non-blocking, indeed even lock-free or wait-free, implementations of dynamic storage allocation for shared data structures. A class of general solutions to value recycling is described in the context of an illustration we call the Repeat Offender Problem (ROP), including illustrative Application Program Interfaces (APIs) defined in terms of the ROP terminology. Furthermore, specific solutions, implementations and algorithm, including a Pass-The-Buck (PTB) implementation are described.

CROSS-REFERENCE TO RELATED APPLICATION(S)

[0001] This application claims benefit under 35 U.S.C. §119(e) of U.S.Provisional Application No. 60/347,773, filed Jan. 11, 2002 and U.S.Provisional Application No. 60/373,359, filed Apr. 17, 2002, each ofwhich is incorporated herein by reference.

BACKGROUND

[0002] 1. Field of the Invention

[0003] The present invention relates generally to coordination amongstexecution sequences in a multiprocessor computer, and more particularly,to structures and techniques for facilitating implementations ofconcurrent data structures and/or programs.

[0004] 2. Description of the Related Art

[0005] Management of dynamically allocated storage presents significantcoordination challenges for multithreaded computations. One clear, butimportant, challenge is to avoid dereferencing pointers to storage thathas been freed (typically by operation of another thread). Similarly, itis important to avoid modifying portions of a memory block that has beendeallocated from a shared data structure (e.g., a node removed from alist by operation of another thread). These and other challenges aregenerally well recognized in the art.

[0006] A common coordination approach that addresses at least some ofthese challenges is to augment values in objects with version numbers ortags, and to access such values only through the use of Compare-And-Swap(CAS) instructions, such that if a CAS executes on an object after ithas been deallocated, the value of the version number or tag will ensurethat the CAS fails. See e.g., M. Michael & M. Scott, NonblockingAlgorithms and Preemption-Safe Locking on Multiprogrammed Shared MemoryMultiprocessors, Journal of Parallel and Distributed Computing,51(1):1-26, 1998. In such cases, the version number or tag is carriedwith the object through deallocation and reallocation, which is usuallyachieved through the use of explicit memory pools. Unfortunately, thisapproach has resulted in implementations that cannot free memory that isno longer required.

[0007] Valois proposed another approach, in which the memory allocatormaintains reference counts for objects in order to, determine when theycan be freed. See J. Valois, Lock-free Linked Lists UsingCompare-and-Swap, in Proceedings of the 14^(th) Annual ACM Symposium onPrinciples of Distributed Computing, pages 214-22, 1995. Valois'approach allows the reference count of an object to be accessed evenafter the object has been released to the memory allocator. Thisbehavior restricts what the memory allocator can do with releasedobjects. For example, the released objects cannot be coalesced. Thus,the disadvantages of maintaining explicit memory pools are shared byValois' approach. Furthermore, application designers sometimes need toswitch between different memory allocation implementations forperformance or other reasons. Valois' approach requires the memoryallocator to support certain nonstandard functionality, and thereforeprecludes this possibility. Finally, the space overhead for per-objectreference counts may be prohibitive. We have proposed another approachthat does allow memory allocators to be interchanged, but depends ondouble compare-and-swap (I?CAS), which is not widely supported. Seee.g., commonly-owned, co-pending U.S. application Ser. No. 09/837,671,filed Apr. 18, 2001, entitled “Lock-Free Reference Counting,” and namingDavid L. Detlefs, Paul A. Martin, Mark S. Moir and Guy L. Steele Jr. asinventors.

[0008] Interestingly, the work that may come closest to meeting the goalof providing support for explicit non-blocking memory management thatdepends only on standard hardware and system support predates the workdiscussed above by almost a decade. Treiber proposed a technique calledobligation passing. See R. Treiber, Systems Programming: Coping withParallelism, Technical Report RJ5118, IBM Almaden Research Center, 1986.The instance of this technique for which Treiber presents specificdetails is in the implementation of a lock-free linked list supportingsearch, insert, and delete operations. This implementation allows freednodes to be returned to the memory allocator through standard interfacesand without requiring special functionality of the memory allocator.However, it employs a “use counter” such that memory is reclaimed onlyby the “last” thread to access the linked list in any period. As aresult, this implementation can be prevented from ever recovering anymemory by a failed thread (which defeats one of the main purposes ofusing lock-free implementations). Another disadvantage of thisimplementation is that the obligation passing code is bundled togetherwith the linked- list maintenance code (all of which is presented inassembler code). Because it is not clear what aspects of the linked-list code the obligation passing code depends on, it is difficult toapply this technique to other situations.

SUMMARY

[0009] It has been discovered that solutions to a value recyclingproblem that we define herein facilitate implementations of computerprograms that may execute as multithreaded computations inmultiprocessor computers, as well as implementations of related shareddata structures. Some exploitations of the techniques described hereinallow non-blocking, shared data structures to be implemented usingstandard dynamic allocation mechanisms (such as malloc and free).Indeed, we present several exemplary realizations of dynamic-sized,non-blocking shared data structures that are not prevented from futurememory reclamation by thread failures and which depend (in someimplementations) only on widely-available hardware support forsynchronization. Some exploitations of the techniques described hereinallow non-blocking, indeed even lock-free or wait-free, implementationsof dynamic storage allocation for shared data structures. Shared datastructures that may benefit from the described techniques may themselvesexhibit non-blocking, lock-free or wait- free properties, though neednot in all implementations or modes of use. In some exploitations, ourwork provides a way to manage dynamically allocated memory in anon-blocking manner without depending on garbage collection. Forexample, techniques described herein may be exploited to managedynamically allocated memory in a non-blocking manner in or for theimplementation of a garbage collector itself.

[0010] A variety of solutions to the proposed value recycling problemmay be implemented. For example, specific solutions are illustrated inthe context of particular data structures, e.g., a lock-free FIFO queuefor which we demonstrate true dynamic sizing. A class of generalsolutions to value recycling is described in the context of anillustration we call the Repeat Offender Problem (ROP), includingillustrative Application Program Interfaces (APIs) defined in terms ofthe ROP terminology. Furthermore, specific solutions, implementationsand algorithm, including a Pass-The-Buck (PTB) implementation aredescribed.

[0011] In some implementations, posting and handoff facilities aredefined and manipulated in a way that results in a wait- free memorymanagement solution suitable for use in conjunction with multi-threadedcomputations and dynamic-sized data structures. Furthermore, theability, in some implementations, to hand off deallocation qualificationfor certain pointers allows multithreaded algorithm designs in whichindividual threads may exit without leaking memory and without waitingfor operations of another thread to complete. Because someimplementations are wait-free, non-blocking properties of various datastructure implementations are not diluted when our techniques areemployed to allow dynamic sizing. Of course, while lock-free datastructures and algorithms represent an important application of thedeveloped techniques, they are not limited thereto. In short, based onthe description herein, persons of ordinary skill in the art willappreciate a wide variety of value recycling solutions and applicationsto practical problems and/or design challenges presented by concurrentshared objects in particular, and multithreaded computations in general.

[0012] In some embodiments in accordance with the present invention, amethod of managing dynamically-allocated memory includes maintaining aset of posting locations for pointers to respective blocks of thedynamically-allocated memory, where individual posting locations areassociated with respective individual threads of a multithreadedcomputation. Responsive to a request to qualify for deallocation thoseblocks referenced by a set of pointers, we determine for each of theposting locations, whether then current contents encode any pointer ofthe first set, and for each pointer of the set not so encoded, qualifythe respective block for deallocation. In some variations, a pointerdetermined to be encoded by the then-current contents is handed off fordeallocation qualification in response to another request therefor. Insome variations, handles are used to manage exclusive thread ownershipof posting locations. Synchronization constructs such as aCompare-and-Swap (CAS) operation, a Load-Linked/Store-Conditional(LL/SC) operation pair and/or transactional sequences (e.g., as mediatedby hardware transactional memory) may be employed in variousrealizations. In general, storage qualified for deallocation may befreed or reused, depending on the exploitation.

BRIEF DESCRIPTION OF THE DRAWINGS

[0013] The present invention may be better understood, and its numerousobjects, features, and advantages made apparent to those skilled in theart by referencing the accompanying drawings.

[0014]FIG. 1 is a depicts a shared memory multiprocessor configurationthat serves as a useful illustrative environment for describingoperation of some shared object implementations in accordance with thepresent invention.

[0015]FIG. 2 illustrates transitions for a value v in accordance withone Repeat Offender Problem (ROP) formulation of value recycling.

[0016]FIG. 3 presents an I/O automaton specifying an exemplaryformulation of the Repeat Offender Problem (ROP). The I/O automatonspecifies environment and ROP output actions as well as state variablesand transitions, including pre-conditions and post-conditions forvarious actions.

[0017]FIG. 4 is a timing diagram that illustrates interesting cases foran exemplary Pass The Buck (PTB) implementation of value recycling.

[0018] The use of the same reference symbols in different drawingsindicates similar or identical items.

DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

[0019] A versatile mechanism has been developed for managing valuesshared amongst threads of a multithreaded computation. In some importantexploitations, certain values so managed encode pointers to storage thatis dynamically allocated, reused and/or freed in a computational system.Accordingly, techniques of the present invention provide a usefulframework for supporting memory management in dynamic-sized datastructures (i.e., those that can grow and shrink over time). Becausesome implementations of these techniques exhibit strong non-blockingproperties (including, in some cases, wait-free properties), thetechniques are particularly attractive for use in connection withnon-blocking implementations of dynamic-sized, data structures. Indeed,a variety of applications to lock-free data structures are describedherein.

[0020] However, while persons of ordinary skill in the art willrecognize that the described techniques may be exploited in connectionwith data structures and/or algorithms that are non-blocking, indeedeven lock-free or wait-free, based on the description herein, persons ofordinary skill in the art will also recognize that the describedtechniques may be exploited in connection with data structures and/oralgorithms that are not necessarily non-blocking or for which not allmodes of operation or use are non-blocking. Accordingly, descriptionsherein made in the context of lock- free data structures are merelyillustrative and provide a useful descriptive context in which thebroader significance of the inventive techniques nay be betterappreciated.

[0021] As a result, descriptions of lock-free data structureexploitations should not be taken as limiting. Indeed, descriptions ofexploitations in which managed values encode pointers should not betaken as limiting. As before, techniques for management of values thatencode pointers simply provide useful descriptive context in which thebroader significance of the inventive techniques may be betterappreciated. Persons of ordinary skill in the art will appreciate, basedon the description herein, that the inventive techniques are moregenerally applicable. In some exploitations, values so managed mayinclude non-pointer values. For example, techniques of the presentinvention may be employed in the avoidance of ABA hazards. In somecases, avoided ABA hazards may involve non-pointer values. In others,avoided ABA hazards may involve pointer values and/or lock-free datastructures. For example, one exploitation described herein illustratesuse of the inventive techniques for avoidance of ABA hazards withoutversion numbering commonly employed in the art.

[0022] Therefore, in view of the above, and without limitation, certainillustrative exploitations of the inventive techniques are describedwith particular attention to dynamic-sizing of lock-free datastructures. Such illustrative exploitations should be viewed only asuseful descriptive context, as the invention is defined solely by theclaims that follow.

[0023] Dynamic-Sized Lock-Free Data Structures

[0024] In general, lock-free data structures avoid many of the problemsassociated with the use of locking, including convoying, susceptibilityto failures and delays, and, in real-time systems, priority inversion. Alock-free implementation of a data structure provides that after afinite number of steps of any operation on the data structure, someoperation completes. For reference, a wait-free implementation of a datastructure provides that after a finite number of steps of any operationon the data structure, that operation completes. Both lock- freeimplementations and wait-free implementations fall within the broaderclass of non-blocking implementations, though wait-freedom is clearlythe stronger non-blocking property. Both of the preceding definitions(i.e., lock-freedom and wait-freedom) tend to preclude the use of locksto protect the data structure, because a thread can take an unboundednumber of steps without completing an operation if some other thread isdelayed or fails while holding a lock the first thread requires.

[0025] Lock-free data structures present design challenges that are wellrecognized in the art and that highlight advantages of the inventivetechniques, although such techniques are more broadly applicable to datastructures and/or algorithms that are not necessarily non-blocking orwhich may exhibit stronger or weaker non-blocking properties. Therefore,without loss of generality, we focus illustratively on lock-free datastructures.

[0026] In general, the difficulty of designing lock-free data structuresis reflected in numerous papers in the literature describing clever andsubtle algorithms for implementing relatively mundane data structuressuch as stacks, queues, and linked lists. There are a variety of reasonsthat dynamic-sized data structures are challenging to implement in alock-free manner. For example, before freeing an object that is part ofa dynamic-sized data structure (e.g., a node of a linked list), it isimportant to ensure that no thread will subsequently modify the object.Otherwise, a thread might corrupt an object allocated later that happensto reuse some of the memory used by the first object. Furthermore, insome systems, even read-only accesses of freed objects can beproblematic: the operating system may remove the page containing theobject from the thread's address space, causing the subsequent access tocrash the program because the address is no longer valid. In general,the use of locks makes it relatively easy to ensure that freed objectsare not subsequently accessed, because we can prevent access by otherthreads to the data structure (or parts thereof) while removing objectsfrom it. In contrast, without locks, multiple operations may access thedata structure concurrently, and a thread cannot determine whether otherthreads are already committed to accessing the object that it wishes tofree. This is the root of one problem that our work aims to address.

[0027] Our techniques build on a problem specification that we call theRepeat Offender Problem (ROP). In some variations, the problem may bespecified more generally in terms of value recycling. Solutions to theproblem can be used to design dynamic-sized, lock- free data structuresthat can free memory to the operating system without placing specialrestrictions on the memory allocation mechanism. Most previousdynamic-sized lock-free data structure implementations do not allowmemory resources used by the data structure to be reclaimed and reusedfor other purposes. ROP solutions are useful in achieving trulydynamic-sized lock- free data structures that can continue to reclaimmemory even in the face of thread failures. Our solution isimplementable in most modern shared memory multiprocessors.

[0028]FIG. 1 depicts a shared memory multiprocessor configuration inwhich techniques of the present invention may be employed. Inparticular, FIG. 1 depicts a pair of processors 111 and 112 that accessstorage 140. Storage 140 includes a shared storage portion 130 and localstorage portions 121 and 122, respectively accessible by executionthreads executing on processors 111 and 112. In general, themultiprocessor configuration is illustrative of a wide variety ofphysical implementations, including implementations in which theillustrated shared and local storage portions correspond to one or moreunderlying physical structures (e.g., memory, register or otherstorage), which may be shared, distributed or partially shared andpartially distributed.

[0029] Accordingly, the illustration of FIG. 1 is meant to exemplify anarchitectural view of a multiprocessor configuration from theperspective of execution threads, rather than any particular physicalimplementation. Indeed, in some realizations, data structures encoded inshared storage portion 130 (or portions thereof) and local storage(e.g., portion 121 and/or 122) may reside in or on the same physicalstructures. Similarly, shared storage portion 130 need not correspond toa single physical structure. Instead, shared storage portion 130 maycorrespond to a collection of sub-portions each associated with aprocessor, wherein the multiprocessor configuration providescommunication mechanisms (e.g., message passing facilities, busprotocols, etc.) to architecturally present the collection ofsub-portions as shared storage. Furthermore, local storage portions 121and 122 may correspond to one or more underlying physical structuresincluding addressable memory, register, stack or other storage that arearchitecturally presented as local to a corresponding processor. Personsof ordinary skill in the art will appreciate a wide variety of suitablephysical implementations whereby an architectural abstraction of sharedmemory is provided. Realizations in accordance with the presentinvention may employ any such suitable physical implementation.

[0030] In view of the foregoing and without limitation on the range ofunderlying physical implementations of the shared memory abstraction,operations on a shared object may be better understood as follows.Memory location 131 contains a pointer A that references an object 132in shared memory. One or more pointers such as pointer A is (are)employed in a typical multithreaded computation. Local storage 134encodes a pointer p₁ that references object 132 in shared memory. Localstorage 135 also encodes a pointer p₂ that references object 132 inshared memory. In this regard, FIG. 1 illustrates a state, *p₁==*A &&*p₂==*A, consistent with successful completion of load-type operationsthat bring a copies of pointer value A into local storage of two threadsof a multithreaded computation.

[0031] As a general matter, FIG. 1 sets up a situation in whichoperations of either thread may use (e.g. deference) their respectivelocally encoded value (e.g., pointer p or P2). Unfortunately, eitherthread may take action that recycles the value (e.g., qualifying, orfreeing, object 132 for reuse) without coordination with the otherthread. Absent coordination, operation of the multithreaded computationmay be adversely affected. To see why this is so, and to explain oursolution(s), we now turn to a simple data structure example.

[0032] Simple Example: A Lock-Free Stack

[0033] To illustrate the need for and use of techniques describedherein, we consider a simple example: a lock- free integer stackimplemented using the compare-and-swap (CAS) instruction. We firstpresent a somewhat naive implementation, and then explain two problemswith that implementation. We then show how to address these problemsusing value recycling techniques described herein. The followingpreliminaries apply to each of the algorithms presented.

[0034] Preliminaries

[0035] Our algorithms are presented in a C/C++-like pseudocode style,and should be self-explanatory. For convenience, we assume ashared-memory multiprocessor with sequentially consistent memory. Wefurther assume that the multiprocessor supports a compare-and-swap (CAS)instruction that accepts three parameters: an address, an expectedvalue, and a new value. The CAS instruction atomically compares thecontents of the address to the expected value, and, if they are equal,stores the new value at the address and returns TRUE. If the comparisonfails, no changes are made to memory, and the CAS instruction returnsFALSE.

[0036] Suitable modifications for other programming styles, other memoryconsistency modes, other multiprocessor architectures and othersynchronization facilities provided by other instruction sets and/ormemory architectures/interfaces, are straightforward. Based on thespecific examples presented herein, persons of ordinary skill in the artwill appreciate a variety of such suitable modifications.

[0037] A Somewhat Naive Implementation and Its Pitfalls

[0038] A straightforward implementation approach for our lock-freeinteger stack is to represent the stack as a linked list of nodes, witha shared pointer—call it TOS—that points to the node at the top of thestack. In this approach, pushing a new value involves allocating a newnode, initializing it with the value to be pushed and the current top ofstack, and using CAS to atomically change TOS to point to the new node(retrying if the CAS fails due to concurrent operations succeeding).Popping is similarly simple: we use CAS to atomically change TOS topoint to the second node in the list (again retrying if the CAS fails),and retrieve the popped value from the removed node. Unless we have GCto reclaim the removed node, we must explicitly free it to avoid amemory leak. Code for this (incorrect) approach follows: struct nodeT{int val; nodeT *next;} shared variable nodeT *TOS initially NULL;Push(int v) { 1: nodeT *oldtos, *newnode = malloc(sizeof(nodeT)); 2:newnode->val = v; 3: do { 4:  oldtos = *TOS; 5:  newnode->next = oldtos;6: } while (!CAS(TOS,oldtOS,newnode)); } int Pop( ) { 7: nodeT *oldtos,*newtos; 8: do { 9:   oldtos = *TOS; 10:  if (oldtos == NULL) return“empty”; 11:  newtos = oldtos->next; 12: } while(!CAS(TOS,oldtos,newtos)); 13: int val = oldtos->val; 14: free(oldtos);15: return val; }

[0039] The first problem with the preceding stack implementation is thatit allows a thread to access a freed node. To see why, observe that athread p executing the Pop code at line 11 accesses the node itpreviously observed (at line 9) to be at the top of the stack. However,if another thread q executes the entire Pop operation between the timesp executes lines 9 and 11, then it will free that node (line 14) and pwill access a freed node.

[0040] The second problem is more subtle. This problem is widely knownas the ABA problem, because it involves a variable changing from onevalue (A) to another (B), and subsequently back to the original value(A). The problem is that CAS does not distinguish between this situationand the one in which the variable does not change at all. The ABAproblem manifests itself in the preceding stack implementation asfollows. Suppose the stack currently contains nodes 1 and 2 (with node 1being at the top of the stack). Further suppose that thread p, executinga Pop operation reads a pointer to node 1 from TOS at line 9, and thenreads a pointer from node 1 to node 2 at line 11, and prepares to useCAS to atomically change TOS from pointing to node 1 to pointing to node2. Now, let us suspend thread p for a moment. In the meantime, thread qexecutes the following sequence of operations: First q executes anentire Pop operation, removing and freeing node 1. Next, q executesanother Pop operation, and similarly removes and deletes node 2. Now thestack is empty. Next q pushes a new value onto the stack, allocatingnode 3 for this purpose. Finally, q pushes yet another value onto thestack, and in this last operation, happens to allocate node 1 again(observe that node 1 was previously freed, so this is possible). Now,TOS points to node 1, which points to node 3. At this point, p resumesexecution and executes its CAS, which succeeds in changing TOS frompointing to node 1 to pointing to node 2. This is incorrect, as node 2has been freed (and may have subsequently been reallocated and reusedfor a different purpose). Further, note that node 3 has been lost fromthe stack. The root of the problem is that p's CAS did not detect thatTOS had changed from pointing to node 1 and later changed so that it wasagain pointing to node 1. This is the dreaded ABA problem. Althoughvalues A and B encode pointers in the preceding example, the ABA problemmay affect non-pointer values as well.

[0041] Our Mechanisms: PostGuard and Liberate

[0042] We provide mechanisms that allow us to efficiently overcome bothof the problems described above without relying on GC. Proper use ofthese mechanisms allows programmers to prevent memory from being freedwhile it might be accessed by some thread. In this subsection, wedescribe how these mechanisms should be used, and illustrate such usefor the stack example.

[0043] The basic idea is that before dereferencing a pointer, a threadguards the pointer, and before freeing memory, a thread checks whether apointer to the memory is guarded. For these purposes, we provide twofunctions, PostGuard (ptr) and Liberate (ptr). PostGuard takes as anargument a pointer to be guarded. Liberate takes as an argument apointer to be checked and returns a (possibly empty) set of pointersthat it has determined are safe to free (at which time these pointersare said to be liberated). Thus, whenever a thread wants to free memory,instead of immediately invoking free, it passes the pointer to Liberate,and then invokes free on each pointer in the set returned by Liberate.

[0044] The most challenging aspect of using our mechanisms is thatsimply guarding a pointer is not sufficient to ensure that it is safe todereference that pointer. The reason is that another thread mightliberate and free a pointer after some thread has decided to guard thatpointer, but before it actually does so. As explained in more detailbelow, Liberate never returns a pointer that is not safe to free,provided the programmer guarantees the following property:

[0045] At the moment that a pointer is passed to Liberate, any threadthat might dereference this instance of the pointer has already guardedit, and will keep the pointer guarded until after any suchdereferencing.

[0046] Note that a particular pointer can be repeatedly allocated andfreed, resulting in multiple instances of that pointer. Thus, thisproperty refers to threads that might dereference a pointer before thatsame pointer is subsequently allocated again.

[0047] If a thread posts a guard on a pointer, and subsequentlydetermines that the pointer has not been passed to Liberate since it waslast allocated, then we say that the guard traps the pointer until theguard is subsequently posted on another pointer, or removed from thispointer. We have found this terminology useful in talking aboutalgorithms that use our mechanisms. It is easy to see that theprogrammer can provide the guarantee stated above by ensuring that thealgorithm never dereferences a pointer that is not trapped.

[0048] We have found that the following simple and intuitive pattern isoften useful for achieving the required guarantee. First, a threadpasses a pointer to Liberate only after it has determined that thememory block to which it points is no longer in the shared datastructure. Given this, whenever a thread reads a pointer from the datastructure in order to dereference it, it posts a guard on that pointer,and then attempts to determine that the memory block is still in thedata structure. If it is, then the pointer has not yet been passed toLiberate and so it is safe to dereference the pointer; if not the threadretries. Determining whether a block is still in the data structure issometimes as simple as rereading the pointer (for example, in the stackexample presented next, we reread TOS to ensure that the pointer is thesame as the one we guarded; see lines 9c and 9d in the exemplary codebelow.)

[0049] Using Our Mechanisms to Fix the Naive Stack Alorithm

[0050] In the exemplary code that follows, we present stack codemodified to make the required guarantee. struct nodeT {int val; nodeT*next;} shared variable nodeT *TOS initially NULL; Push(int v) { 1:nodeT *oldtos, *newnode = malloc(sizeof(nodeT)); 2: newnode->val = v; 3:do { 4: oldtos = *TOS; 5: newnode->next = oldtos; 6: } while(!CAS(TOS,oldtos,newnode)); } int Pop( ) { 7: nodeT *oldtos, *newtos; 8:do { 9a: do { 9b: oldtos = *TOS; 9c: PostGuard(oldtos); 9d: } while(*TOS != oldtos) 10: if (oldtos == NULL) return “empty”; 11: newtos =oldtos->next; 12: } while (!CAS(TOS,oldtos,newtos)) 13: int val =oldtos->val; l4a: PostGuard(NULL); 14b: for (nodeT *n inLiberate(oldtos)) 14c: free (n); 15: return val; }

[0051] To see how the modified code makes this guarantee, suppose that athread p passes a pointer to node 1 to Liberate (line 14b) at time t.Prior to t,p changed TOS to a node other than node 1 or to NULL (line12), and thereafter, until node 1 is liberated, freed and reallocated,TOS does not point to node 1. Suppose that after time t, another threadq dereferences (at line 11 or line 13) that instance of a pointer tonode 1. When q last executes line 9d, at time t′, prior to dereferencingthe pointer to node 1, q sees TOS pointing to node 1. Therefore, t′ musthave preceded t. Prior to t′, q guarded its pointer to node 1 (line 9c),and keeps guarding that pointer until after it dereferences it, asrequired. Note that the pointer is guarded until q executes line 14a(which stands down the guard) or line 9c (which effectively reassignsthe guard to another post).

[0052] Guarantees of PostGuard and Liberate

[0053] While the above descriptions are sufficient to allow a programmerto correctly apply our mechanisms to achieve dynamic-sized datastructures, it may be useful to understand in more detail the guaranteesthat are provided by the Liberate function. Below we describe thoseguarantees, and argue that they are sufficient, when they are usedproperly as described above, to prevent freed pointers from beingdereferenced.

[0054] We say that a pointer begins escaping when Liberate is invokedwith that pointer. Every liberated pointer—that is, every pointer in theset returned by a Liberate invocation—is guaranteed to have thefollowing properties:

[0055] It previously began escaping.

[0056] It has not been liberated (by any Liberate invocation) since itmost recently began escaping.

[0057] It has not been guarded continuously by any thread since it mostrecently began escaping.

[0058] If pointers are only freed after they are returned by Liberate,the first two conditions guarantee that every instance of a pointer isfreed at most once. They are sufficient for this purpose because threadsonly pass pointers to Liberate when they would have, in thestraightforward (but defective) code, freed the pointers, and threadsfree only those pointers returned by Liberate invocations.

[0059] The last condition guarantees that a pointer is not liberatedwhile it might still be dereferenced. To see that this last condition issufficient, recall that the programmer must guarantee that any pointerpassed to Liberate at time t will be dereferenced only by threads thatalready guarded the pointer at time t and will keep the pointer guardedcontinuously until after such dereferencing. The last condition preventsthe pointer from being liberated while any such thread exists.

[0060] Representative Application Programming Interface (API)

[0061] In this section, we present an application programming interface(API) for the guarding and liberating mechanisms illustrated in theprevious section. This API is more general than the one used in theprevious section. In particular, it allows threads to guard multiplepointers simultaneously.

[0062] Our API uses an explicit notion of guards, which are posted onpointers. In this API, a thread invokes PostGuard with both the pointerto be guarded and the guard to post on the pointer. We represent a guardby an int. A thread can guard multiple pointers by posting differentguards on each pointer. A thread may hire or fire guards dynamically,according to the number of pointers it needs to guard simultaneously,using the HireGuard and FireGuard functions. We generalize Liberate totake a set of pointers as its argument, so that many pointers can passedto Liberate in a single invocation. The signatures of all thesefunctions are shown below. typedef guard int; typedef ptr_t (void *);void PostGuard(guard g, ptr_t p); guard HireGuard(); voidFireGuard(guard g); set[ptr_t] Liberate(set[ptr_t] S);

[0063] Below we examine each function in more detail.

[0064] Detailed Function Descriptions void PostGuard(guard g, ptr_t p)Purpose: Posts a guard on a pointer. Parameters: The guard g and theponter p; g must have been hired and not subsequently fired by thethread invoking this function. Return value: None. Remarks: If p is NULLthen g is not posted on any pointer after this function returns. If p isnot NULL, then g is posted on p from the time this function returnsuntil the next invocation of PostGuard with the guard g. guardHireGuard() Purpose: “Acquire” a new gaurd. Parameters: None. Returnvalue: A guard. Remarks: The guard returned is hired when it isreturned. When a guard is hired, either it has not been hired before, orit has been fired since it was last hired. void FireGuard(guard g)Purpose: “Release” a guard. Parameters: The guard g to be fired; g musthave been hired and not subsequently fired by the thread invoking thisfunction. Return value: None. Remarks: g is fired when FireGuard(g) isinvoked. set[ptr_t] Liberate(set[ptr_t] S) Purpose: Prepare pointers tobe freed. Parameters: A set of pointers to liberate. Every pointer inthis set must either never have begun escaping or must have beenliberated since it most recently began escaping. That is, no pointer wasin any set passed to a previous Liberate invocation since it was mostrecently in the set returned by some Liberate operation. Return value: Aset of liberated pointers. Remarks: The pointers in the set S beginescaping when Liberate(S) is invoked. The pointers in the set returnedare liberated when the function returns. Each liberated pointer musthave been contained in the set passed to some invocation of Liberate,and not in the set returned by any Liberate operation after thatinvocation. Furthermore, Liberate guarantees foe each pointer that itreturns that no guard has been posted continuously on the pointer sinceit was most recently passed to some Liberate operation.

[0065] Comments on API design

[0066] We could have rolled the functionality of hiring and firingguards into the PostGuard operation. Instead, we kept this functionalityseparate to allow implementations to make PostGuard, the most commonoperation, as efficient as possible. This separation allows theimplementation more flexibility in managing resources associated withguards because the cost of hiring and firing guards can be amortizedover many PostGuard operations.

[0067] In some applications, it may be desirable to be able to quickly“mark” a value for liberation, without doing any of the work ofliberating the value. Consider, for example, an interactive system inwhich user threads should not execute relatively high-overhead“administrative” work such as liberating values, but additionalprocessor(s) may be available to perform such work. In such a case, itmay be desirable to provide two (or more) versions of Liberate, where aquick version simply hands off all the values it is passed and returnsthe empty set.

[0068] Finally, our terminology is somewhat arbitrary here. In general,the relevant concepts may be expressed differently (or more generally)in other implementations. For example, rather than posting a guard on avalue, a given implementation may be described in terms of an Announceoperation that announces an intention to use the value. Similarly,functionality corresponding to that of guards may be provided using“handles” or other facilities for making such announcements. Liberatingconstructs may be implemented or represented in terms of a “cleaner”operation and related states. In short, many variations may be madewithout departing from central aspects of our inventive techniques.

[0069] The Repeat Offender Problem

[0070] We now more formally define a Repeat Offender Problem (ROP),which captures some important properties of the support mechanisms fornon-blocking memory management described herein. ROP is defined withrespect to a set of values, a set of application clients, and a set ofguards. Each value may be free, injail, or escaping; initially, allvalues are free. An application-dependent external Arrest action cancause a free value to become injail at any time. A client can helpinjail values to begin escaping, which causes them to become escaping.Values that are escaping can finish escaping and become free again.

[0071] Clients can use values, but must never use a value that is free.A client can attempt to prevent a value v from escaping while it isbeing used by “posting a guard” on v. However, if the guard is postedtoo late, it may fail to prevent v from escaping. Thus, to safely use v,a client must ensure that v is injail at some time after it posted aguard on v. Clients can hire and fire guards dynamically, according totheir need.

[0072] ROP solutions can be used by threads (clients) to avoiddereferencing (using) a pointer (value) to an object that has beenfreed. In this context, an injail pointer is one that has been allocated(arrested) since it was last freed, and can therefore be used.

[0073] ROP solutions provide the following procedures: A client hires aguard by invoking HireGuard( ), and it fires a guard g that it employsby invoking FireGuard (g). A ROP solution ensures that a guard is neversimultaneously employed by multiple clients. A client posts a guard g ona value v by invoking PostGuard(g,v); this removes the guard from anyvalue it previously guarded. In the implementation illustrated, aspecial null value is used to remove the guard from the previouslyguarded value without posting the guard on a new value. A client may notpost (or remove) a guard that it does not currently employ. A clienthelps a set of values S to begin escaping by invoking Liberate (S); theapplication must ensure that each value in S is injail before this call,and the call causes each value to become escaping. The Liberateprocedure returns a (possibly different) set of escaping values causingthem to be liberated; each of these values becomes free on the return ofthis procedure. These transitions are summarized in FIG. 2. A ROPsolution does not implement the functionality of the Arrest action-thisis application-specific, but the ROP specification models arrests inorder to know when a free value becomes injail.

[0074] If a guard g is posted on a value v, and v is injail at some timet after g is posted on v and before g is subsequently removed orreposted on a different value, then we say that g traps v from time tuntil g is removed or repeated. Of course, subsequent removal orreposting is not a requirement for trapping. Accordingly, if g is neverremoved or reposted g traps v at a time t and all later times. Theoperational specification of the main correctness condition for ROP isthat it does not allow a value to escape (i.e., become free) while it istrapped.

[0075] A precise formulation of ROP is given by the I/O automaton shownin FIG. 3, explained below. Of course, any of a variety ofimplementations in accordance with the I/O automaton are suitable. Webegin by adopting some notational conventions.

[0076] Notational Conventions: Unless otherwise specified, p and qdenote clients (threads) from P, the set of all clients (threads); gdenotes a guard from G, the set of all guards; v denotes a value from V,the set of all values, and S and T denote sets of values (i.e., subsetsof V). We assume that V contains a special null value that is neverused, arrested, or passed to liberate.

[0077] The automaton consists of a set of environment actions and a setof ROP output actions. Each action consists of a precondition forperforming the action and the effect on state variables of performingthe action. Most environment actions are invocations of ROP operations,and are paired with corresponding ROP output actions that represent thesystem's response to the invocations. In particular, thePostInv_(p)(g,v) action models client_(p) invoking PostGuard(g, v), andthe PostRes_(p)( ) action models the completion of this procedure. TheHireInv_(p)( ) action models client p invoking HireGuard( ), and thecorresponding HireResp_(p)(g) action models the system assigning guard gtop. The FireInv_(p)(g) action models client p calling FireGuard (g),and the FireResp_(p)( ) action models the completion of this procedure.The LiberateInv_(p)(S) action models client_(p) calling Liberate (S) tohelp the values in S start escaping, and the LiberateResp_(p)(T) actionmodels the completion of this procedure with a set of values T that havefinished escaping. Finally, the Arrest (v) action models the environmentarresting value v.

[0078] The state variable status[v] records the current status of valuev, which can be free, injail, or escaping. Transitions between thestatus values are caused by calls to and returns from ROP procedures, aswell as by the application-specific Arrest action, as described above.The post variable maps each guard to the value (if any) it currentlyguards. The pc_(p) variable models control flow of clientp, for exampleensuring that p does not invoke a procedure before the previousinvocation completes; pc_(p) also encodes parameters passed to thecorresponding procedures in some cases. The guards_(p) variablerepresents the set of guards currently employed by client p. Thenum-escaping variable is an auxiliary variable used to specifynon-triviality properties, as discussed later. Finally, trapping mapseach guard g to a boolean value that is true if f≠g has been posted onsome value v, and has not subsequently been reposted (or removed), andat some point since the guard was posted on v, v has been injail (i.e.,it captures the notion of guard g trapping the value on which it hasbeen posted). This is used by the LiberateResp action to determinewhether v can be returned. Recall that a value should not be returned ifit is trapped.

[0079] Preconditions on the invocation actions specify assumptions aboutthe circumstances under which the application invokes the correspondingROP procedures. Most of these preconditions are mundane well-formednessconditions, such as the requirement that a client posts only guards thatit currently employs. The precondition for LiberateInv captures theassumption that the application passes only injail values to Liberate,and the precondition for the Arrest action captures the assumption thatonly free values are arrested. In general, a determination of how theseguarantees are made is a matter of design choice.

[0080] Preconditions on the response actions specify the circumstancesunder which the ROP procedures can return. Again, most of thesepreconditions are quite mundane and straightforward. The interestingcase is the precondition of LiberateResp, which states that Liberate canreturn a value only if it has been passed to (some invocation of)Liberate, it has not subsequently been returned by (any invocation of)Liberate, and no guard g has been continually guarding the value sincethe last time it was injail. Recall that this is captured bytrapping[g].

[0081] Desirable Properties

[0082] As specified so far, an ROP solution in which Liberate alwaysreturns the empty set, or simply does not terminate, is correct.Clearly, in the context motivating our work, such solutions areunacceptable because each escaping value represents a resource that willbe reclaimed only when the value is liberated (returned by someinvocation of Liberate). One might be tempted to specify that everyvalue passed to a Liberate operation is eventually returned by someLiberate operation. However, without special operating system support,it is generally not possible to guarantee such a strong property in theface of failing threads. We do not specify here a particularnon-triviality condition, as we do not want to unduly limit the range ofsolutions. Instead, we discuss some properties that might be useful inspecifying non-triviality properties for proposed solutions.

[0083] The state variable numescaping counts the number of values thatare currently escaping (i.e., that have been passed to some invocationof Liberate and have not subsequently been returned from any invocationof Liberate). If we require a solution to ensure that numescaping isbounded by some function of application specific quantities, we excludethe trivial solution in which Liberate always returns the empty set.However, because this bound necessarily depends on the number ofconcurrent Liberate operations, and the number of values each Liberateoperation is invoked with, it does not exclude the solution in whichLiberate never returns.

[0084] A combination of a boundedness requirement and some form ofprogress requirement on Liberate operations seems to be the mostappropriate way to specify the non-triviality requirement. Recall thatwe have defined a general value recycling problem in terms of RepeatOffender style terminology (e.g., posting guards, liberation and statessuch as injail and escaping) and that we expect a variety ofimplementations and algorithms to solve that general problem. One suchimplementation includes the Pass The Buck (PTB) algorithm (detailedbelow), which for simplicity of description is also presented in RepeatOffender style terminology. Turning to the PTB algorithm, we canestablish that PTB provides a bound on numescaping that depends on thenumber of concurrent Liberate operations. Because the bound(necessarily) depends on the number of concurrent Liberate operations,if an unbounded number of threads fail while executing Liberate, then anunbounded number of values can be escaping. We emphasize, however, thatour implementation does not allow failed threads to prevent values frombeing freed in the future. This property is an important advantage overTreiber's approach (referenced above).

[0085] Our Pass The Buck algorithm has two other desirable properties:First, the Liberate operation is wait-free (that is, it completes aftera bounded number of steps, regardless of the timing behavior of otherthreads). Thus, we can calculate an upper bound on the amount of timeLiberate will take to execute, which is useful in determining how toschedule Liberate work. Finally, our algorithm has a property we callvalue progress. Roughly, this property guarantees that a value does notremain escaping forever provided Liberate is invoked “enough” times(unless a thread fails while the value is escaping).

[0086] Dynamic-Sized Lock-Free Queues

[0087] In this section, we present two dynamic-sized lock-free queueimplementations based on a widely used lock-free queue algorithmpreviously described by Michael and Scott. See M. Michael & M. Scott,Nonblocking Algorithms and Preemption-Safe Locking on MultiprogrammedShared Memory Multiprocessors, Journal or Parallel and DistributedComputing, 51(1):1-26, 1998. Note that Michael and Scott's algorithm anddata structure implemented generally in accordance therewith provide uswith useful context to describe our additional inventive concepts.Nothing herein should be taken as a suggestion that our techniques arederived from, linked with, or limited to the algorithm, data structuresor any design choices embodied in or by Michael and Scott's work.

[0088] In Michael and Scott's algorithm (hereafter M&S), a queue isrepresented by a linked list, and nodes that have been dequeued areplaced in a “freelist” implemented in the style of Treiber. In thedescription that follows, we refer to such freelists as “memory pools”in order to avoid confusion between “freeing” a node—by which we meanreturning it to the memory allocator through the free libraryroutine—and placing a node on a freelist. In this approach, rather thanfreeing nodes to the memory allocator when they are no longer required,we place them in a memory pool from which new nodes can be allocatedlater. An important disadvantage of this approach is that datastructures implemented this way are not truly dynamic-sized: after theyhave grown large and subsequently shrunk, the memory pool contains manynodes that cannot be reused for other purposes, cannot be coalesced,etc.

[0089] Our two queue implementations achieve dynamic-sizing in differentways. Algorithm 1 eliminates the memory pool, invoking the standard malloc and free library routines to allocate and deallocate nodes of thequeue. Algorithm 2 does use a memory pool, but unlike M&S, the nodes inthe memory pool can be freed to the system. We present our algorithms inthe context of a transformed version of the M&S algorithm (see below).This “generic code” invokes additional procedures that must beinstantiated to achieve full implementations. We first provide exemplaryinstantiations consistent with the original M&S algorithm. Then, weprovide instantiations for our new algorithms. In this way, weillustrate true dynamic-sizing in the context of a familiar lock-freedata structure design that does not itself provide a true dynamic sizingcapability.

[0090] Note that although the M&S design does allow nodes to be addedand removed from the queue, such nodes are added from, and removed to, amemory pool. Since no mechanism is provided to remove nodes from thememory pool, the amount of storage allocated for use by the queue ismonotonic, non-decreasing. Accordingly, it is not really correct todescribe the M&S design as a dynamic-sized lock-free data structure. Ourwork achieves a true dynamic-sized lock-free queue.

[0091] Michael and Scott's Algorithm

[0092] In general, the M&S design will be understood in the context ofthe following slightly transformed version which plays the role of a“generic code” base for the modified versions that follow. The M&Sdesign builds on a queue data structure that will be understood asfollows: struct pointer_t { node_t *ptr; int version; } struct node_t {int value; pointer_t next; } struct queue_t { pointer_t Head, Tail; }queue_t *newQueue( ) { queue_t *Q = malloc(sizeof (queue_t)); node_t*node = allocNode( ); node->next.ptr = null; Q->Head.ptr = Q->Tail.ptr =node; return Q; } bool Enqueue(queue_t *Q, int value) { 1 node_t *node =allocNode( ); 2 if (node == null) 3 return FALSE; 4 node->value = value;5 node->next.ptr = null; 6 while (TRUE) { 7 pointer_t tail; 8GuardedLoad(&Q->Tail, &tail, 0); 9 pointer_t next = tail.ptr->next; 10if (tail == Q->Tail) { 11 if (next.ptr == null) { 12 if(CAS(&tail.ptr->next, next, <node,next.version+1>)) 13 break; 14 } else15 CAS(&Q->Tail, tail, <next.ptr,tail.version+1>) } } 16 CAS(&Q->Tail,tail, <node.tail,version+1>) 17 Unguard(0); 18 return TRUE; } boolDequeue(queue_t *Q, int *pvalue) { 19 while (TRUE) { 20 pointer_t head;21 GuardedLoad(&Q->Head, &head, 0); 22 pointer_t tail = Q->Tail; 23pointer_t next; 24 GuardedLoad(&head.ptr->next, &next, 1); 25 if (head== Q->Head) { 26 if (head.ptr == tail.ptr) { 27 if (next.ptr == null) {28 Unguard(0); 29 Unguard(1); 30 return FALSE; } 31 CAS(&Q->Tail, tail,<next.ptr,tail.version+1>) 32 } else { 33 *pvalue = next.ptr->value; 34if (CAS(&Q->Head, head, <next.ptr,head.version+1>)) 35 break; } } } 36Unguard(0); 37 Unguard(1); 38 deallocNode(head.ptr); 39 return TRUE; }

[0093] The preceding generic code invokes four additional procedures,shown in italics, which are not specified in the generic code. Threevariations on the M&S design can be achieved using three differentimplementations of the additional procedure sets. For completeness, afirst set results in an implementation that corresponds to the originalM&S design. In short, we get the original M&S algorithm by instantiatingthe following procedures: node_t *allocNode( ) { 1 if (memory pool isempty) 2 return malloc(sizeof(node_t)); 3 else { 4 return node removedfrom memory pool;   } } void deallocNode(node_t *n) { 5 add n to memorypool } void GuardedLoad(pointer_t *s, pointer_t *t, int h) { 6 *t = *s;7 return; } void Unguard(int h) { 8 return; }

[0094] The allocNode and deallocNode procedures use a memory pool TheallocNode procedure removes and returns a node from the memory pool ifpossible and calls mal loc if the Pool is empty The deallocNodeprocedure puts the node being deallocated into the memory pool. Asstated above, nodes in the memory pool cannot be freed to the systemMichael and Scott do not specify how nodes are added to and removed fromthe memory pool Because the M&S design does not use value recyclingtechnique such as that provided by solutions to the ROP, it has nonotion of “guards.” As a result, GuardedLoad is an ordinary load andUnguard is a no-op.

[0095] We do not discuss M&S in detail. See M. Michael & M. Scott,Nonblocking Algorithms and Preemption-Safe Locking on MultiprogrammedShared Memory Multiprocessors, Journal or Parallel and DistributedComputing, 51(1):1-26, 1998 for such details. Instead, we discuss belowthe aspects that are relevant for our purposes.

[0096] Although nodes in the M&S memory pool have been deallocated, theycannot be freed to the system because some thread may still intend toperform a CAS on the node. Various problems can arise from accesses tomemory that has been freed. Thus, although it is not discussed at lengthby the authors, M&S′ use of the memory pool is necessary forcorrectness. Because Enqueue may reuse nodes from the memory pool, M&Suses version numbers to avoid the ABA problem, in which a CAS succeedseven though the pointer it accesses has changed because the node pointedto was deallocated and then subsequently allocated. The version numbersare stored with each pointer and are atomically incremented each timethe pointer is modified. This causes such “late” CAS's to fail, but itdoes not prevent them from being attempted.

[0097] The queue is represented by two node pointers: the Head, fromwhich nodes are dequeued, and the Tail, where nodes are enqueued. TheHead and Tail pointers are never null; the use of a “dummy” node ensuresthat the list always contains at least one node. When a node isdeallocated, no path exists from either the Head or the Tail to thatnode. Furthermore, such a path cannot subsequently be established beforethe node is allocated again in an Enqueue operation Therefore, if such apath exists, then the node is in the queue. Also, once a node is in thequeue and its next field has become non-null, its next field cannotbecome null again until the memory that contains the node issubsequently reallocated, implying that the node has been freed beforethat time. These properties provide a basis to establish correctness ofour dynamic-sized variants of the generic M&S design.

[0098] Algorithm 1—Direct Dynamic Sizing (No Memory Pool)

[0099] As mentioned earlier, Algorithm 1 eliminates the memory pool, anduses malloc and free directly for memory allocation As discussed below,Algorithm 1 also eliminates the ABA problem, and thus, the need forversion numbers. Significantly and unlike the original M&S design, thisfeature allows designs based on Algorithm 1 to be used on systems thatsupport CAS only on pointer-sized values.

[0100] For ease of understanding we present variations for Algorithm 1in the context of the above “generic code.” Algorithm 1, with truedynamic sizing, is achieved by instantiating the following variations ofthe additional procedures for use by the generic code: Node_t*allocNode( ) { 1 return malloc(sizeof(node_t)); } voiddeallocNode(node_t *n) { 2 for each m ε Liberate({n}) 3 free (m); } voidGuardedLoad(pointer_t *s, pointer_t *t, int g) { 4 while (TRUE) { 5 *t =*s; 6 if (t->ptr == null) 7  return; 8 PostGuard(guards[p][g], t->ptr);9 if (*t == *s) 10  return;  } } void Unguard(int g) { 11PostGuard(guards[p][g], null); }

[0101] Persons of ordinary skill in the art will, of course, recognizethat separation of Algorithm 1 into additional procedures and genericcode is arbitrary and is employed purely for illustration.Corresponding, indeed even technically equivalent, implementations maybe achieved without resorting to this artificial pedagogical separation.

[0102] As explained below, the preceding procedures employ valuerecycling techniques in accordance with a ROP formulation We assume thatbefore accessing the queue, each thread p has hired two guards andstored identifiers for these guards in guards[p][0] and guards[p][1].The allocNode procedure simply invokes malloc. However, because somethread may have a pointer to a node being deallocated, deallocNodecannot simply invoke free. Instead, deallocNode passes the node beingdeallocated to Liberate and then frees any nodes returned by Liberate.The properties of ROP ensure that a node is never returned by aninvocation of Liberate while some thread might still access that node.

[0103] The GuardedLoad procedure loads a value from the addressspecified by its first argument and stores the value loaded in theaddress specified by its second argument. The goal of this procedure isto ensure that the value loaded is guarded by the guard specified by thethird argument before the value is loaded. This goal is accomplished bya lock-free loop that retries if the value loaded changes after theguard is posted (null values do not have to be guarded, as they willnever be dereferenced). As explained below, GuardedLoad helps ensurethat guards are posted soon enough to trap the pointers they guard, andtherefore to prevent the pointers they guard from being freedprematurely. The Unguard procedure removes the specified guard.

[0104] Correctness argument for Algorithm 1

[0105] Operation of Algorithm 1 corresponds to that of M&S, except forissues involving memory allocation and deallocation. To see this,observe that GuardedLoad implements an ordinary load, and Unguard doesnot affect any variables of the underlying M&S algorithm. Therefore, weneed only argue that no instruction accesses a freed node. Because nodesare freed only after being returned by Liberate, it suffices to arguefor each access to a node, that, at the time of the access, a pointer tothe node has been continuously guarded since some point at which thenode was in the queue (that is, a node is accessed only if it istrapped). As discussed earlier, if there is a path from either Head orTail to a node, then the node is in the queue. As shown below, we canexploit code already included in M&S, together with the specializationcode in FIG. 6, to detect the existence of such paths.

[0106] We first consider the access at line 9 of Enqueue. In this case,the pointer to the node being accessed was acquired from the call toGuardedLoad at line 8. Because the pointer is loaded directly from Tailin this case, the load in line 9 of the Algorithm 1 implementation ofGuardedLoad serves to observe a path (of length one) from Tail to theaccessed node. The argument is similarly straightforward for the accessat line 12 of Enqueue and the access in GuardedLoad when invoked fromline 24 (Dequeue).

[0107] The argument for the access at line 33 of Dequeue is not assimple. First, observe that the load at line 9 of GuardedLoad (in thecall at line 24 of Dequeue) determines that there is a pointer from thenode specified by Head.ptr to the node accessed at line 33. Then, thetest at line 25 determines that there is a pointer from Head to the nodespecified by Head.ptr. If these two pointers existed simultaneously atsome point between the guard being posted as a result of the call atline 24 and the access at line 33, then the required path existed. Asargued above, the node pointed to by Head.ptr is guarded and was in thequeue at some point since the guard was posted in the call toGuardedLoad at line 21, and this guard is not removed or reposted beforethe execution of line 33. Therefore, relying on the properties of ROP,this node cannot be freed and reallocated in this interval Also, in theM&S algorithm, a node that is dequeued does not become reachable fromHead again before it has been reallocated by an Enqueue. Therefore, theload at line 25 confirmed that Head contained the same valuecontinuously since the execution of line 21. This in turn implies thatthe two pointers existed simultaneously at the point at which the loadin GuardedLoad invoked from line 24 was executed. The last part of thisargument can be made much more easily by observing that the versionnumber (discussed next) of Head did not change. However, we laterobserve that the version numbers can be eliminated from Algorithm 1, sowe do not want to rely on them in our argument. This concludes ourargument that Algorithm 1 never accesses freed memory.

[0108] Next, we show that the version numbers for the node pointers areunnecessary in our Algorithm 1. Apart from the overhead involved withmanaging these version numbers, the requirement that they are updatedatomically with pointers renders algorithms that use them inapplicablein systems that support CAS only on pointer-sized values. Accordingly,the ability to eliminate version numbers is an important achievement inand of itself. In addition, since version numbers were “necessary” toavoid an ABA hazard, another useful exploitation of the inventedtechniques is now apparent, namely ABA hazard avoidance.

[0109] Eliminating version numbers in Algorithm 1

[0110] By inspecting the code for Algorithm 1, we can see that the onlyeffect of the version numbers is to make some comparisons fail thatwould otherwise have succeeded. These comparisons are always between ashared variable V and a value previously read from V. The comparisonswould fail anyway if V's pointer component had changed, and wouldsucceed in any case if V had not been modified since the V was read.Therefore, version numbers change the algorithm's behavior only in thecase that a thread p reads value A from V at time t, V subsequentlychanges to some other value B, and later still, at time t′, V changesback to a value that contains the same pointer component as A, and pcompares V to A. With version numbers, the comparison would fail, andwithout them it would succeed. We begin by establishing that versionnumbers do not affect the outcome of comparisons other than the one inline 9 of GuardedLoad. We deal with that case later.

[0111] We first consider cases in which A's pointer component isnon-null. It can be shown for each shared pointer variable V in thealgorithm that the node pointed to by A is freed and subsequentlyreallocated between times t and t′ in this case. Furthermore, it can beshown that each of the comparisons mentioned above occurs only if aguard was posted on A before time t and is still posted when thesubsequent comparison is performed, and that the value read from A wasin the queue at some point since the guard was posted when thecomparison is performed. Because ROP prohibits nodes from being returnedby Liberate (and therefore from being freed) in this case, this impliesthat these comparisons never occur in Algorithm 1.

[0112] We next consider the case in which A's pointer component is null.The only comparison of a shared variable to a value with a null pointeris the comparison performed at line 12 of the Enqueue operation (becausethe Head and Tail never contain null and therefore neither do the valuesread from them). As argued earlier, the access at line 12 is performedonly when the node being accessed is trapped. Also, as discussedearlier, the next field of a node in the queue does not become nullagain until the node is initialized by the next Enqueue operation toallocate that node. However, ROP ensures that the node is not returnedfrom Liberate, and is therefore not subsequently freed and reallocated,before the guard is removed or reposted.

[0113] It remains to consider the comparison in line 9 of GuardedLoad,which can have a different outcome if version numbers are used than itwould if they were not used However, this does not affect theexternally-observable behavior of the GuardedLoad procedure, andtherefore does not affect correctness. The only property of theGuardedLoad procedure on which we have depended for our correctnessargument is the following: GuardedLoad stores a value v in the locationpointed to by its second argument such that v was in the locationpointed to by GuardedLoad's first argument at some point during theexecution of GuardedLoad and that a guard was posted on (the pointercomponent of) v before that time and has not subsequently been repostedor removed. It is easy to see that this property is guaranteed by theGuardedLoad procedure, with or without version numbers.

[0114] Algorithm 2—with Memory Pool

[0115] One drawback of Algorithm 1 is that every Enqueue and Dequeueoperation involves a call to the malloc or free library routine (orother similar facility) introducing significant overhead. In addition,every Dequeue operation invokes Liberate, which is also likely to beexpensive. Algorithm 2 overcomes these disadvantages by reintroducingthe memory pool However, unlike the M&S algorithm, nodes in the memorypool of Algorithm 2 can be freed to the system.

[0116] Algorithm 2 is achieved by instantiating the generic code(described above) with the same GuardedLoad and Unguard procedures usedfor Algorithm 1, though with modified allocNode and deallocNodeprocedures such as illustrated below: Pointer_t Pool; Node_t*allocNode( ) { 1 pointer_t oldPool, newPool; 2 while (TRUE) { 3GuardedLoad(&Pool, &oldPool, 0); 4 if (oldPool.ptr == null) { 5Unguard(0); 6 return malloc(sizeof(node_t)); } 7 newPool =oldPool.ptr->next; 8 Unguard(0); 9 newPool.version = oldPool.version +1; 10 if (CAS(&Pool, oldPool, newPool)) { 11 return oldPool.ptr; }  } }void deallocNode(node_t *n) { 12 pointer_t oldPool, newPool; 13 while(TRUE) { 14 oldPool = Pool; 15 n->next.ptr = oldPool.ptr; 16 newpool.ptr= n; 17 newPool.version = oldpool.version + 1; 18 if (CAS(&Pool,oldpool, newPool)) 19 return;  } }

[0117] As in the original M&S algorithm, the allocNode and deallocNodeprocedures, respectively, remove nodes from and add nodes to the memorypool Unlike the original algorithm, however, the memory pool isimplemented so that nodes can be freed. Thus, by augmenting Algorithm 2with a policy that decides between freeing nodes and keeping them in thememory pool for subsequent use, a truly dynamic-sized implementation canbe achieved.

[0118] The above procedures use a linked- list representation of a stackfor a memory pool. This implementation extends Treiber's straightforwardimplementation by guarding nodes in the Pool before accessing them; thisallows us to pass removed nodes to Liberate and to free them whenreturned from Liberate without the risk of a thread accessing a nodeafter it has been freed. Our memory pool implementation is described inmore detail below.

[0119] The node at the top of the stack is pointed to by a globalvariable Pool. We use the next field of each node to point to the nextnode in the stack. The deallocNode procedure uses a lock-free loop; eachiteration uses CAS to attempt to add the node being deallocated onto thetop of the stack As in Treiber's implementation, a version number isincremented atomically with each modification of the Pool variable toavoid the ABA problem.

[0120] The allocnode procedure is more complicated. In order to remove anode from the top of the stack, allocNode determines the node that willbecome the new top of the stack. This is achieved by reading the nextfield of the node that is currently at the top of the stack. As before,we use a ROP-style value recycling solution to protect against thepossibility of accessing (at line 7) a node that has been freed.Therefore, the node at the top of the stack is guarded and thenconfirmed by the GuardedLoad call at line 3. As in the easy casesdiscussed above for Algorithm 1, the confirmation of the pointer loadedby the call to GuardedLoad establishes that the pointer is trapped,because a node will not be passed to Liberate while it is still at theHead of the stack.

[0121] We have not specified when or how nodes are passed to Liberate.There are many possibilities and the appropriate choice depends on theapplication and system under consideration Any of a variety of designchoices are suitable. One possibility is for the deallocNode procedureto liberate nodes when the size of the memory pool exceeds some fixedlimit. Alternatively, we could have an independent “helper” thread thatperiodically (or routinely) checks the memory pool and decides whetherto liberate some nodes in order to reduce the size of the memory poolSuch decisions could be based on the size of the memory pool or on othercriteria. In general, there is no need for the helper thread to grow thememory pool because this will occur naturally. When there are no nodesin the memory pool, allocNode invokes malloc to allocate space for a newnode.

[0122] The Pass The Buck Algorithm

[0123] In this section, we describe one value recycling solution, thePass The Buck (PTB) algorithm. As before, the PTB algorithm is presentedusing ROP terminology. An important goal when designing PTB was tominimize the performance penalty to the application when no values arebeing liberated. That is, the PostGuard operation should be implementedas efficiently as possible, perhaps at the cost of a more expensiveLiberate operation. Such solutions are desirable for at least tworeasons. First, PostGuard is invoked by the application, so itsperformance impacts application performance. On the other hand, Liberatework can be done by a spare processor, or by a background thread, sothat it does not directly impact application performance. Second,solutions that optimize PostGuard performance are desirable forscenarios in which values are liberated infrequently, but we must retainthe ability to liberate them. An example is the implementation of adynamic-sized data structure that uses a memory pool to avoid allocatingand freeing objects under “normal” circumstances but can free elementsof the memory pool when it grows too large. In this case, no liberatingis necessary while the size of the data structure is relatively stable.With these goals in mind, we describe our Pass The Buck algorithm below.struct {value val; int ver} HO_t // HO_t fits into CAS-able locationconstant MG: max. number of guards shared variable GUARDS:array[0..MG-1] of bool init false; MAXG: int init 0; POST:array[0..MG-1] of value init null; HNDOFF: array[0..MG-1] of HO_t init<null, 0>; int HireGuard( ) } 1 int i = 0, max; 2 while(!CAS(&GUARDS[i],false,true)) 3 i++; 4 while ((max = MAXG) < i) 5CAS(&MAXG,max,i); 6 return i; } void FireGuard(int i) { 7 GUARDS[i] =false; 8 return; } void PostGuard(int i, value v) { 9 POST [ii] = v; 10return; } value set Liberate(value set vs) { 11 int i = 0; 12 while (i=MAXG) { 13 int attempts =0; 14 HO_t h = HNDOFF[i]; 15 value v =POST[i]; 16 if (v != null && vs->search(v)) { 17 while (true) { 18 if(CAS(&HNDOFF[i], h, <v, h.ver+1>)) 19 vs->delete(v); 20 if (h.val !=null) vs->insert(h.val); 21 break; } 22 attempts++; 23 if (attempts ==3) break; 24 h = HNDOFF[i]; 25 if (attempts == 2 && h.val != null)break; 26 if (v != POST[i]) break; } 27 } else { 28 if (h.val != null &&h.val != v) 29 if (CAS(&HNDOFF[i], h, <null, h.ver+1>)) 30vs->insert(h.val); } 31 i++; } 32 return vs; }

[0124] Throughout the algorithm, the pointers to blocks of memory beingmanaged are called values. The GUARDS array is used to allocate guardsto threads. Here we assume a bound MAXG on the number of guardssimultaneously employed. However, as later explained, we can remove thisrestriction. The POST array includes one location per guard, which holdsthe pointer value the guard is currently assigned to guard if oneexists, and NULL otherwise. The HNDOFF array is used by Liberate to“hand off” responsibility for a value to another Liberate operation ifthe value has been trapped by a guard.

[0125] The HireGuard and FireGuard procedures essentially implementlong-lived renaming. Specifically, for each guard g, we maintain anentry GUARDS [g], which is initially false. Thread p hires guard g byatomically changing GUARDS [g] from false (unemployed) to true(employed);p attempts this with each guard in turn until it succeeds(lines 2 and 3). The FireGuard procedure simply sets the guard back tofalse (line 7). The HireGuard procedure also maintains the sharedvariable MAXG, which is used by the Liberate procedure to determine howmany guards to consider. Liberate considers every guard for which aHireGuard operation has completed. Therefore, it suffices to have eachHireGuard operation ensure that MAXG is at least the index of the guardreturned. This is achieved with the simple loop at lines 4 and 5.

[0126] PostGuard is implemented as a single store of the value to beguarded in the specified guard's POST entry (line 9), in accordance withour goal of making PostGuard as efficient as possible.

[0127] Some of the most interesting parts of the PTB algorithm lie inthe Liberate procedure. Recall that Liberate should return a set ofvalues that have been passed to Liberate and have rot since beenreturned by Liberate, subject to the constraint that Liberate cannotreturn a value that has been continuously guarded by the same guardsince before the value was most recently passed to Liberate (i.e.,Liberate must not return trapped values).

[0128] Liberate is passed a set of values, and it adds values to andremoves values from its value set as described below before returning(i.e., liberating) the remaining values in the set Because we want theLiberate operation to be wait-free, if some guard g is guarding a valuev in the value set of some thread p executing Liberate, then p musteither determine that g is not trapping v or remove v from p's value setbefore returning that set To avoid losing values, any value that premoves from its set must be stored somewhere so that, when the value isno longer trapped, another Liberate operation may pick it up and returnit The interesting details of PTB concern how threads determine that avalue is not trapped, and how they store values while keeping spaceoverhead for stored values low Below, we explain the Liberate procedurein more detail, paying particular attention to these issues.

[0129] The loop at lines 12 through 31 iterates over all guards everhired. For each guard g, if p cannot determine for some value v in itsset that v is not trapped by g, then p attempts to “hand v off to g.” Ifp succeeds in doing so (line 18), it removes v from its set (line 19)and proceeds to the next guard (lines 21 and 31). If p repeatedlyattempts and fails to hand v off to g, then, as we explain below, vcannot be trapped by g, so p can move on to the next guard. Also, asexplained in more detail below, p might simultaneously pick up a valuepreviously handed off to g by another Liberate operation, in which casethis value can be shown not to be trapped by g, so p adds this value toits set (line 20). When p has examined all guards (see line 12), it cansafely return any values remaining in its set (line 32).

[0130] We describe the processing of each guard in more detail belowFirst, however, we present a central property of a correctness proof ofthis algorithm, which will aid the presentation that follows; this lemmais quite easy to see from the code and the high-level description giventhus far.

[0131] Single Location Lemma: For each value v that has been passed tosome invocation of Liberate and not subsequently returned by anyinvocation of Liberate, either v is handed off to exactly one guard, orv is in the value set of exactly one Liberate operation (but not both).Also, any value handed off to a guard or in the value set of anyLiberate operation has been passed to Liberate and not subsequentlyreturned by Liberate.

[0132] The processing of each guard g proceeds as follows: At lines 15and 16, p determines whether the value currently guarded by g (ifany)—call it v—is in its set. If so, p executes the loop at lines 17through 26 in order to either determine that v is not trapped, or toremove v from its set. In order to avoid losing v in the latter case, p“hands v off to g” by storing v in HNDOFF [g]. In addition to the value,an entry in the HNDOFF array contains a version number, which, forreasons that will become clear later, is incremented with eachmodification of the entry. Because at most one value may be trapped byguard g at any time, a single location HNDOFF [g] for each guard g issufficient. To see why, observe that if p needs to hand v off because itis guarded, then the value (if any)—call it w—previously stored inHNDOFF [g] is no longer guarded, so p can pick w up and add it to itsset Because p attempts to hand off v only if v is in p's set, the SingleLocation Lemma implies that v≠w. The explanation above gives the basicidea of our algorithm, but it is simplified. There are various subtlerace conditions that must be avoided. Below, we explain in more detailhow the algorithm deals with these race conditions.

[0133] To hand v off to g, p uses a CAS operation to attempt to replacethe value previously stored in HNDOFF [g] with v (line 18); this ensuresthat, upon success, p knows which value it replaced, so it can add thatvalue to its set (line 20). We explain later why it is safe to do so. Ifthe CAS fails due to a concurrent Liberate operation, then p rereadsHNDOFF [g] (line 24) and loops around to retry the handoff There arevarious conditions under which we break out of this loop and move on tothe next guard. Note in particular that the loop completes after at mostthree CAS attempts; see lines 13, 22, and 23. Thus our algorithm iswait-free. We explain later why it is safe to stop trying to hand v offin each of these cases.

[0134] We first consider the case in which p exits the loop due to asuccessful CAS at line 18. In this case, as described earlier, p removesv from its set (line 19), adds the previous value in HNDOFF [g] to itsset (line 20), and moves on to the next guard (lines 21 and 31). Animportant part of understanding our algorithm is to understand why it issafe to take the previous value—call it w—of HNDOFF [g] to the nextguard. The reason is that we read POST [g] (line 15 or 26) betweenreading HNDOFF[g] (line 14 or 24) and attempting the CAS at line 18.Because each modification to HNDOFF [g] increments its version number,it follows that w was in HNDOFF [g] when p read POST [g]. Also, recallthat w≠v in this case. Therefore, when p read POST [g], w was notguarded by g Furthermore, because w remained in HNDOFF [g] from thatmoment until the CAS, w cannot become trapped in this interval. This isbecause a value can become trapped only if it has not been passed toLiberate since it was last allocated, and all values in the HNDOFF arrayhave been passed to some invocation of Liberate and not yet returned byany invocation of Liberate(and have therefore not been freed andreallocated since being passed to Liberate).

[0135] It remains to consider how p can break out of the loop withoutperforming a successful CAS. In each case, p can infer that v is nottrapped by g, so it can give up on its attempt to hand v off. If pbreaks out of the loop at line 26, then v is not trapped by g at thatmoment simply because it is not even guarded by g The other two cases(lines 23 and 25) occur only after a certain number of times around theloop, implying a certain number of failed CAS operations.

[0136] To see why we can infer that v is not trapped in each of thesetwo cases, consider the timing diagram in FIG. 4. For the rest of thissection, we use the notation v_(p) to indicate the value of thread p'slocal variable v in order to distinguish between the local variables ofdifferent threads. In FIG. 4, we construct an execution in which p failsits CAS three times. The bottom line represents thread p:

[0137] at (A), p reads HNDOFF [g] for the first time (line 14);

[0138] at (B), p's CAS fails;

[0139] at (C), p rereads HNDOFF [g] at line 24; and

[0140] so on for (D), (E), and (F).

[0141] Because p's CAS at (B) fails, some other thread q₀ executingLiberate performed a successful CAS after (A) and before (B). We chooseone and call it (G). The arrows between (A) and (G) and between (G) and(B) indicate that we know (G) comes after (A) and before (B). Similarly,some thread q₁ executes a successful CAS on HNDOFF [g] after (C) andbefore (D ) call it (H); and some thread q₂ executes a successful CAS onHNDOFF [g] after (E) and before (F)—call it (I). Threads q₀ through q₂might not be distinct, but there is no loss of generality in treatingthem as if they were.

[0142] Now, consider the CAS at (H). Because every successful CASincrements the version number field of HNDOFF [g], q₁'s previous read ofHNDOFF [g] (at line 14 or line 24)—call it (J)—must come after (G).Similarly, q₂'s previous read of HNDOFF [g] before (I)—call it (K)—mustcome after (H).

[0143] We consider two cases. First, suppose (H) is an execution of line18 by q₁. In this case, between (I) and (L), q₁ read POST [g]=v_(q1),either at line 15 or at line 26; call this read (L). By the SingleLocation Lemma, because v_(p) is in p's set, the read at (L) impliesthat v_(p) was not guarded by g at (L). Therefore, v_(p) was not trappedby g at (L), which implies that it is safe for p to break out of theloop after (D)) in this case (observe that attempts_(p)=2 in this case).

[0144] For the second case, suppose (H) is an execution of line 29 bythread q₁. In this case, because q₁ is storing null instead of a valuein its own set, the above argument does not work. However, because pbreaks out of the loop at line 25 only if it reads a non-null value fromHNDOFF [g] at line 24, it follows that if p does so, then somesuccessful CAS stored a non-null value to HNDOFF [g] at or after (H),and in this case the above argument can be applied to that CAS to showthat v_(p) was not trapped. If p reads null at line 24 after (D), thenit continues through its next loop iteration.

[0145] In this case, there is a successful CAS (I) that comes after (H).Because (H) stored null in the current case, no subsequent execution ofline 29 by any thread will succeed before the next successful executionof the CAS in line 18 by some thread To see why, observe that the CAS atline 29 never succeeds while HNDOFF [g] contains null (see line 28).Therefore, for (I) to exist, there is a successful execution of the CASat line 18 by some thread after (H) and at or before (I). Using thisCAS, we can apply the same argument as before to conclude that v_(p) wasnot trapped. It is easy to see that PTB is wait-free.

[0146] As described so far, p picks up a value from HNDOFF [g] only ifits value set contains a value that is guarded by guard g. Therefore,without some additional mechanism, a value stored in HNDOFF [g] mightnever be picked up from there. To avoid this problem, even if p does notneed to remove a value from its set, it still picks up the previouslyhanded off value (if any) by replacing it with null (see lines 28through 30). We know it is safe to pick up this value by the argumentabove that explains why it is safe to pick up the value stored in HNDOFF[g] in line 18. Thus, if a value v is handed off to guard g, then thefirst Liberate operation to begin processing guard g after v is nottrapped by g will ensure that v is picked up and taken to the next guard(or returned from Liberate if g is the last guard), either by thatLiberate operation or some concurrent Liberate operation.

[0147] Although various shared variables employed in the above exemplaryrealizations (e.g., GUARDS [ ], POST[ ] and HNDOFF[ ]) are implementedas arrays of predetermined size, it is relatively straightforward torelax this restriction should it be desirable to do so in certainimplementations or environments. For example, we could replace theGUARDS array by a linked list of elements, each containing at least oneguard location. Association of posting and handoff locations with agiven guard would be by any suitable data structure. Instead of steppingthrough the GUARDS array to hire a guard, threads would now traverse thelinked list; if a thread reaches the end of the list withoutsuccessfully hiring a guard, it can allocate a new node, and use CAS toattempt to atomically append the new node to the list. If this CASfails, the thread resumes traversing the list from that point.

[0148] Other Embodiments

[0149] While the invention is described with reference to variousimplementations and exploitations, it will be understood that theseembodiments are illustrative and that the scope of the invention(s) isnot limited to them. Terms such as always, never, all, none, etc. areused herein to describe sets of consistent states presented by a givencomputational system, particularly in the context of correctness proofs.Of course, persons of ordinary skill in the art will recognize thatcertain transitory states may and do exist in physical implementationseven if not presented by the computational system. Accordingly, suchterms and in variants will be understood in the context of consistentstates presented by a given computational system rather than as arequirement for precisely simultaneous effect of multiple state changes.This “hiding” of internal states is commonly referred to by calling thecomposite operation “atomic”, and by allusion to a prohibition againstany process seeing any of the internal states partially performed.

[0150] Many variations, modifications, additions, and improvements arepossible. For example, while application to particular concurrent sharedobjects and particular implementations thereof have been described indetail herein, applications to other shared objects and otherimplementations will also be appreciated by persons of ordinary skill inthe art. In addition, more complex shared object structures may bedefined, which exploit the techniques described herein. Othersynchronization constructs or primitives may be employed. For example,while many implementations have been described in the context ofcompare-and-swap (CAS) operations, based on that description, persons ofordinary skill in the art will appreciate suitable modifications toemploy alternative constructs such as a load-linked, store-conditional(LL/SC) operation pair or transactional sequence or facilities oftransactional memory, should such alternative constructs be available ordesirable in another implementation or environment.

[0151] Plural instances may be provided for components, operations orstructures described herein as a single instance. Finally, boundariesbetween various components, operations and data stores are somewhatarbitrary, and particular operations are illustrated in the context ofspecific illustrative configurations. Other allocations of functionalityare envisioned and may fall within the scope of the invention(s). Ingeneral, structures and functionality presented as separate componentsin the exemplary configurations may be implemented as a combinedstructure or component. Similarly, structures and functionalitypresented as a single component may be implemented as separatecomponents. These and other variations, modifications, additions, andimprovements may fall within the scope of the invention(s).

What is claimed is:
 1. A method of managing dynamically-allocated memory, the method comprising: providing a posting facility for pointers to respective blocks of the dynamically-allocated memory, wherein individual elements of the posting facility are associated with respective individual threads of a multithreaded computation; and based at least in part on a first request to qualify for deallocation those of the blocks referenced by a first set of pointers, determining for each of the posting facility elements whether then-current contents encode any pointer of the first set, and for each pointer of the first set not so encoded, qualifying the respective block for deallocation.
 2. The method of claim 1, wherein the posting facility is implemented as a set of posting locations.
 3. The method of claim 1, wherein the posting facility is implemented, at least in part, using a message passing system.
 4. The method of claim 1, wherein the determining and qualifying are performed in servicing the first request.
 5. The method of claim 1, wherein servicing of the first request hands off the first set of pointers to another thread for performance of the determining and qualifying.
 6. The method of claim 1, further comprising: announcing, via a respective element of the posting facility, intention to use a particular pointer value.
 7. The method of claim 1, further comprising: prior to the requesting, determining that the blocks referenced by the first set of pointers are no longer referenced via a data structure shared amongst threads of a multithreaded computation.
 8. The method of claim 1, wherein at least a particular one of the blocks referenced by a first set of pointers is qualified for deallocation.
 9. The method of claim 8, further comprising: reusing the particular block prior to any freeing thereof.
 10. The method of claim 8, further comprising: freeing the particular block.
 11. The method of claim 1, further comprising: for any particular pointer of the first set determined to be encoded by the then-current contents, handing off the particular pointer for deallocation qualification in response to a second request to qualify for deallocation.
 12. The method of claim 11, wherein the second request includes a retry of the first request.
 13. The method of claim 11, wherein the second request includes a request to qualify for deallocation those of the blocks referenced by a second set of pointers.
 14. The method of claim 13, wherein the first and second sets are distinct.
 15. The method of claim 13, wherein the first and second sets at least partially overlap.
 16. The method of claim 1, wherein the first set numbers at least one.
 17. The method of claim 11, wherein initiating of the second request precedes initiating of the first request.
 18. The method of claim 11, wherein initiating of the first request precedes initiating of the second request.
 19. The method of claim 11, wherein deallocation qualification in response to the first request and deallocation qualification in response to the second request are performed by separate threads.
 20. The method of claim 11, wherein deallocation qualification in response to the first request and deallocation qualification in response to the second request are performed by a same thread.
 21. The method of claim 2, further comprising: maintaining a set of handles each associated with a respective one of the posting locations, wherein exclusive ownership of a particular one of the posting locations is obtained by or for a particular one of the threads using a synchronization construct.
 22. The method of claim 21, wherein the synchronization construct includes a Compare-And-Swap (CAS) operation.
 23. The method of claim 21, wherein the synchronization construct includes a Load-Linked, Store-Conditional (LL/SC) operation pair.
 24. The method of claim 21, wherein the synchronization construct includes a transactional sequence.
 25. The method of claim 20, wherein execution of the transactional sequence is supported, at least in part, by hardware transactional memory.
 26. The method of claim 2, further comprising: maintaining a set of handoff locations for handed off pointers.
 27. The method of claim 26, wherein each of the handoff locations is associated with a respective one of the posting locations.
 28. The method of claim 26, wherein the sets of posting locations and handoff locations are organized as respective arrays.
 29. The method of claim 26, wherein the sets of posting locations and handoff locations are organized as respective lists.
 30. The method of claim 27, wherein displacement, by the handing off, of a pointer from a particular one of the handoff locations is sufficient to establish that the displaced pointer is not trapped by the associated posting location.
 31. The method of claim 26, wherein the handing off employs a synchronization construct for coordination amongst plural invocations of a request to qualify for deallocation.
 32. The method of claim 31, wherein the synchronization construct includes at least one of: a compare-and-swap (CAS) operation; a load-linked, store-conditional (LL/SC) operation pair; and a transactional sequence.
 33. The method of claim 1, wherein an individual thread of the multithreaded computation announces intent to use particular pointer values using respective elements of the posting facility exclusively owned by the individual thread.
 34. The method of claim 1, further comprising: responsive to the first request, qualifying for deallocation at least some additional block corresponding to a pointer not included in the first set thereof, wherein a pointer to the qualified additional block has been handed off for deallocation qualification by operation of another request.
 35. The method of claim 1, further comprising: further performing the determining in response to a second request subsequent to the first request.
 36. The method of claim 1, further comprising exclusively associating respective elements of the posting facility with individual threads of the multithreaded computation.
 37. The method of claim 36, wherein plural of the posting locations are associated with a particular one of the threads.
 38. The method of claim 36, wherein at least some of the associating is performed dynamically during the multithreaded computation.
 39. The method of claim 1, embodied as an application programming interface (API) that implements at least: a first routine that, when invoked, allows a first thread of the multithreaded computation to announce an intention to use a pointer value by posting to a respective element of the posting facility; and a second routine that, when invoked, allows a second thread of the multithreaded computation to initiate the deallocation qualification of memory blocks referenced by a set of pointers submitted thereto.
 40. The method of claim 39, wherein the first and second threads are different threads.
 41. The method of claim 39, wherein the first and second threads are a same thread.
 42. The method of claim 39, embodied as an application programming interface (API) that further implements: a third routine that, when invoked, allows the first thread of the multithreaded computation to cancel its prior announcement.
 43. The method of claim 39, embodied as an application programming interface (API) that further implements: one or more routines that, when invoked, manage exclusive ownership of individual elements of posting and handoff facilities by respective ones of the threads.
 44. The method of claim 1, embodied as part of a garbage collector implementation.
 45. The method of claim 1, wherein operation thereof is one or more of: wait-free; lock-free; and non-blocking.
 46. The method of claim 1, employed, at least in part, in management of storage for an implementation of a dynamically sizable data structure, wherein the data structure implementation is one or more of wait-free; lock-free; and non-blocking.
 47. The method of claim 1, employed, at least in part, in management of storage for an implementation of a dynamically sizable data structure, wherein the data structure implementation is not necessarily non-blocking.
 48. A non-blocking memory management facility for managing dynamically allocated memory in a multithreaded computation the non-blocking memory management facility comprising: posting locations; and a first functional sequence that, when initiated by a first computational thread that accesses memory, attempts to qualify for deallocation blocks of the memory referenced by a first set of pointers submitted thereto by determining for each of the posting locations whether then current contents encode any pointer of the first set, and for each pointer of the first set not so encoded, signifying the qualification.
 49. The non-blocking memory management facility of claim 48, further comprising: handoff locations associated with respective ones of the posting locations.
 50. The non-blocking memory management facility of claim 49, wherein concurrent access to the handoff locations is mediated using a synchronization construct.
 51. The non-blocking memory management facility of claim 50, wherein the synchronization construct includes at least one of: a compare-and-swap (CAS) operation; a load-linked, store-conditional (LL/SC) operation pair; and a transactional sequence.
 52. The non-blocking memory management facility of claim 48, wherein, for any particular pointer of the first set determined to be encoded by the then-current contents of one or more of the posting locations, the first functional sequence hands off the particular pointer for deallocation qualification in response to a second request therefor.
 53. The non-blocking memory management facility of claim 52, wherein the hand off employs a synchronization construct for coordination amongst plural invocations of a request to qualify for deallocation.
 54. The non-blocking memory management facility of claim 48, further comprising: handles each associated with a respective one of the posting locations, wherein exclusive ownership of a particular one of the posting locations is obtained by or for a particular one of the threads using a synchronization construct with respect to a corresponding one of the handles.
 55. The non-blocking memory management facility of claim 48, further comprising: a second functional sequence that, when initiated by a second computational thread that accesses the memory, announces an intent to use a particular pointer value by storing the particular pointer value in one of the posting locations.
 56. The non-blocking memory management facility of claim 52, further comprising: handoff locations, wherein displacement, by the hand off, of a pointer from a particular one of the handoff locations is sufficient to establish that the displaced pointer is not trapped by the associated posting location.
 57. The non-blocking memory management facility of claim 48, wherein the multithreaded computation frees at least some of the blocks qualified for deallocation.
 58. The non-blocking memory management facility of claim 48, wherein the multithreaded computation subsequently reuses, prior to freeing, at least some of the blocks qualified for deallocation.
 59. The non-blocking memory management facility of claim 48, wherein, as part of at least some instances thereof, the first functional sequence qualifies for deallocation at least some additional block corresponding to a pointer not included in the first set thereof, wherein a pointer to the qualified additional block is handed off for deallocation qualification by operation of another execution thereof.
 60. The non-blocking memory management facility of claim 59, wherein the another execution is by either the first thread or another thread of the multithreaded computation.
 61. The non-blocking memory management facility of claim 48, embodied as an application programming interface (API) that implements one or more respective routines that, when invoked: allow the threads of the multithreaded computation to announce intent to use pointer values; and perform the deallocation qualification.
 62. The non-blocking memory management facility of claim 48, embodied as an application programming interface (API) that implements one or more respective routines that, when invoked: allow the threads of the multithreaded computation to post guards on pointer values; and liberate pointer values.
 63. The non-blocking memory management facility of claim 62, wherein the application programming interface (API) further implements one or more respective routines that, when invoked: dynamically establish exclusive ownership of the guards by respective ones of the threads.
 64. The non-blocking memory management facility of claim 48, embodied as part of a garbage collector implementation.
 65. The non-blocking memory management facility of claim 48, wherein operation thereof is one or more of: wait-free; lock-free; and non-blocking.
 66. The non-blocking memory management facility of claim 48, employed, at least in part, in management of storage for an implementation of a dynamically sizable data structure, wherein the data structure implementation is one or more of: wait-free; lock-free; and non-blocking.
 67. The non-blocking memory management facility of claim 48, employed, at least in part, in management of storage for an implementation of a dynamically sizable data structure, wherein the data structure implementation is not necessarily non-blocking.
 68. A computer program product encoded in one or more computer readable media and including a mechanism for management of dynamically allocated memory, the mechanism comprising: one or more data structures instantiable in memory to represent a set of posting locations; and a first instruction sequence that, when executed by a first computational thread that accesses the memory, attempts to qualify for deallocation blocks of the memory referenced by a first set of pointers by determining for each of the posting locations whether then-current contents encode any pointer of the first set, and for each pointer of the first set not so encoded, qualifying the respective block for deallocation.
 69. The computer program product of claim 68, further comprising: one or more data structures instantiable in memory to represent a set of handoff locations, wherein for any particular pointer of the first set determined to be encoded by the then current contents, first functional sequence hands off the particular pointer for deallocation qualification in response to another request therefor.
 70. The computer program product of claim 68, further comprising: a second instruction sequence that, when executed by a second computational thread that accesses the memory, announces an intent to use a particular pointer value by posting to a respective one of the posting locations.
 71. The computer program product of claim 68, embodied as a part of a memory allocator.
 72. The computer program product of claim 68, embodied at least in part as a garbage collection facility.
 73. The computer program product of claim 68, wherein the one or more computer readable media are selected from the set of a disk, tape or other magnetic, optical or electronic storage medium and a network, wireline, wireless or other communications medium.
 74. An apparatus comprising: a set posting locations; and means for qualifying for deallocation blocks of memory referenced by a first set of pointers, the qualifying means including means for determining, for each of the posting locations, whether then-current contents encode any pointer of the first set, and for each pointer of the first set not so encoded, qualifying the respective block for deallocation.
 75. The apparatus of claim 74, further comprising: means for handing off a particular pointer determined to be encoded by the then current contents, the handoff means employing a synchronization construct to hand off the particular pointer to another operation of the qualifying means.
 76. The apparatus of claim 74, further comprising: means for managing thread ownership of the posting locations. 